Introduction to Hypothesis Testing

DATAX121-23A (HAM) & (SEC) - Introduction to Statistical Methods

Learning Outcomes

  • Understand the concept of a hypothesis test
  • How to construct and interpret a hypothesis test for the population mean
  • Understand the link between a confidence interval for the population mean and a two-sided hypothesis test for the population mean
  • Lies, damn lies, and statistics: On the topic of a p-value

Hypothesis testing

The purpose of a hypothesis test

In T031, we learnt that we could quantify the uncertainty in statistics calculated from a single sample

  • We only formally covered the case for the sample mean, \(\bar{x}\), to infer the value of the population mean, \(\mu\), with a confidence interval

In the many fields that apply statistical methods, they often want to “measure evidence” given that a hypothesis is true

That is, a statistical test that uses data to judge whether a statement about the population (or process)—where we collected the data from—may be true or not

Definition: The null & alternative hypotheses

The null hypothesis, represented by the symbol H0, is a statement that there is “nothing” happening. In most situations, the researcher hopes to disprove or reject the null hypothesis

The alternative hypothesis, represented by the symbol H1 or Ha, is a statement that “something” is happening. In most situations, this hypothesis is what the researcher hopes to prove

— Utts & Heckard (2015)

The structure of these hypothesis statements, logically and mathematically, allows us to examine whether the data provide enough evidence to refute the null, \(H_0\), in support of the alternative, \(H_1\)

From research questions to hypothesis statements

Do tertiary students spend more than half of their weekly income on rent?1

Let \(p\) be the underlying proportion of weekly income that tertiary students spend on rent

\(\quad H_0\!: p = 0.5\)

\(\quad H_1\!: p > 0.5\)

Do people read faster using a printed book or a Kindle or iPad?

Let \(\mu_\text{Book}\), \(\mu_\text{Kindle}\), and \(\mu_\text{iPad}\) be the underlying mean reading speed of people using a printed book, Kindle, and iPad, respectively

\(\quad H_0\!: \mu_\text{Book} = \mu_\text{Kindle} = \mu_\text{iPad}\)

\(\quad H_1\!: \text{at least one} ~ \mu_i \ne \mu_j\)

Measuring evidence against the null

One way to understand this mechanism is to consider the sampling distribution of a statistic by defining what we “hypothesise” the parameter to be

This allows us to answer the following two questions:

  1. What kind of statistics might we see because of sampling error if the null hypothesis were true?
  2. How unusual would it be to see a statistic as extreme as the observed statistic calculated from our sample if the null hypothesis were true?

Figure: The sampling distribution of \(\bar{x}\) when \(n = 10\)

Definition: p-value

The p-value (or P-value) is calculated by assuming that the null hypothesis is true, and then determining the probability of a statistic as extreme as, or more extreme than, the observed statistic

Briefly speaking: A probability1 can be defined as the chance of an “event” according to a probability distribution

Prescription of α1

\(\alpha\), also known as the significance level, is the borderline between when a p-value is “small” enough and when it is not “small” enough

The most common choice is \(\alpha = 0.05 ~ (5\%)\)

In any statistical test, e.g. a hypothesis test, the smaller the p-value, the stronger the evidence is against the null hypothesis, and the stronger the evidence is in favour of the alternative hypothesis

Evidence against the null

Very strong Strong Some Weak None
p-value ≤ 0.01 0.01 to 0.05 ≈ 0.05 0.05 to 0.10 > 0.1
If α = 0.05

Two-sided versus one-sided tests

Recall from Slide 6 that:

  • The null hypotheses were expressed with \(=\) signs
  • Whereas, the alternative hypotheses were expressed with \(\neq\), \(>\), or \(<\) signs

A difference exists between the choice of \(\neq\) (two-sided) versus \(>\) or \(<\) (one-sided), and for most, if not all, cases, this affects the calculation of the p-value

In practice, most hypothesis tests are two-sided because stating the “direction” of the alternative hypothesis is not always clear when we translate a research question into a set of null and alternative hypotheses

A hypothesis test for μ

Also, known as the one-sample t-test (for μ)

Assumptions for a hypothesis test for μ

  1. Independent observations—typically met with random samples or randomisation of the data collection order with randomised experiments
  2. Unimodal—one peak
  3. Approximately symmetrical about the sample mean, \(\bar{x}\), and there are no outliers

More on 2.

  • In practice, this is to be certain that we can appropriately summarise our data with one measure of centre

Definition: The test statistic for μ

\[ t_0 = \frac{\bar{x} - \mu_0}{\text{se}(\bar{x})} \]

where:

  • \(t_0\) is the T-test statistic (for μ)
  • \(\bar{x}\) is the sample mean
  • \(\mu_0\) is the hypothesised value of the population mean
  • \(\text{se}(\bar{x})\) is the standard error of \(\bar{x}\)—see T03, Slide 19

Briefly: Student’s t-distribution

William S. Gosset, and others, derived the exact probability distribution to model the probability of observing an interval of test statistics

When the test statistic is for the population mean, \(\mu\), we use the Student’s t-distribution to calculate the p-value

The mathematical details relevant for us in DATAX121 is that:

  • The Student’s t-distribution can defined with the degrees of freedom, \(\nu\)
  • It’s the exact model for some sampling distributions of test statistics

Figure: The Student’s t-distribution when \(\nu = 9\) superimposed on top of simulated T-test statistics from a population whose \(\mu = 50\)

Calculation of the p-value (for μ)

Let \(T\) be the Student’s t-distribution with \(\nu = n - 1\)

  • \(\nu\) is the Student’s t-distribution’s degrees of freedom parameter
  • \(n\) is the number of observations in the sample

If it is a two-sided test, e.g. \(H_1 \! : \mu \neq x\)

\(\quad p\text{-value} = 2 \times \mathbb{P}(T > |t_0|)\)

\(|t_0|\) stands for the absolute value of \(t_0\), which “removes” the sign of a value

For example:

  • \(|-2| = 2\)
  • \(|15| = 15\)

If it is a one-sided test and \(H_1 \! : \mu > x\)

\(\quad p\text{-value} = \mathbb{P}(T > t_0)\)

If it is a one-sided test and \(H_1 \! : \mu < x\)

\(\quad p\text{-value} = \mathbb{P}(T < t_0)\)

The R function we want to manually1 replicate

lightspeed.df <- read.csv("datasets/lightspeed.csv")

t.test(pass.time ~ 1, data = lightspeed.df, conf.level = 0.95, mu = 24.8296)

    One Sample t-test

data:  pass.time
t = -0.91633, df = 19, p-value = 0.371
alternative hypothesis: true mean is not equal to 24.8296
95 percent confidence interval:
 24.82615 24.83095
sample estimates:
mean of x 
 24.82855 

CS 2.1 revisited: Replication with light speeds

Recall that the theoretical passage time for Newcomb’s experiment was 24.8296 millionths of a second.

\(H_0\!: \mu = 24.8296\)

\(H_1\!: \mu \neq 24.8296\)

\(p\text{-value} = 2 \times \mathbb{P}(T > |-0.92|)\)

# Calculate and assign a several statistics to their own objects
xbar <- mean(lightspeed.df$pass.time)
n <- nrow(lightspeed.df)
se <- sd(lightspeed.df$pass.time) / sqrt(n)
mu0 <- 24.8296

c(xbar, n, se, mu0)
[1] 24.828550000 20.000000000  0.001145874 24.829600000
# Calculate the test statistic for mu
t0 <- (xbar - mu0) / se
t0
[1] -0.9163314
# Calculate the p-value 
2 * pt(abs(t0), df = n - 1, lower.tail = FALSE)
[1] 0.3709779

On the interpretation of hypothesis tests

Style One

We do not reject that the underlying mean passage time is 24.8296 millionths of a second at the 5% significance level, in favour of the alternative that it is not 24.8296 millionths of a second (p-value = 0.3710).

Style Two

We have no evidence against the underlying mean passage time being 24.8296 millionths of a second, in favour of the alternative that it is not 24.8296 millionths of a second (p-value = 0.3710).

Critical features

  • Quantify the evidence against the null hypothesis
  • Contextualise the null and alternative hypotheses with units (where applicable)
  • Quote the exact p-value

Briefly: What is α1 really?

You may had noticed that \(\alpha\) is only a borderline (or threshold) value we use to evaluate the p-value against

It more specifically represents the tolerable probability of making a Type I error. A Type I error is the scenario where we reject the null in favour of the alternative, but it is true in reality

An example of a Type I error is a doctor telling a biological male that “you are pregnant”, when they should had said “you are not pregnant”

CS 1.1 revisited: NZ income snapshot in 2011

Synthetic sample data based on real data from the June quarter 2011 NZ Income Survey1. The survey was an annual snapshot to produce income statistics on New Zealanders aged 15 and over based on a representative sample of the population.

Variables
ethnicity A factor denoting the ethnicity with 6 levels
region A factor denoting the region of residence
gender A factor denoting the gender, male or female
agegp A factor denoting the five year age-band. Note that the value 65 describes an individual aged 65 or older
qualification A factor denoting the highest qualtification level with 5 levels
occupation A factor denoting the category of the main income source with 10 levels
hours A number denoting the weekly hours worked from all wages and salary jobs excluding self-employment
income A number denoting gross weekly income from all sources ($)

CS 1.1 revisited: NZ income snapshot in 2011

nzis.df <- read.csv("datasets/NZIS-CART-SURF-2011.csv")
histogram( ~ income, data = nzis.df, nint = 50, type = "count",
          xlab = "Gross Weekly Income ($)",
          main = "NZer's gross weekly income snapshot in 2011")

Figure: The gross weekly income of 29447 New Zealanders

We want to test whether the population gross weekly income, in 2011, was greater than $1000

  • Assumptions:
  • Hypothesis statements
    • \(H_0: \mu = 1000\)
    • \(H_1: \mu > 1000\)
  • \(p\text{-value} = \mathbb{P}(T > t_0)\)

CS 1.1 revisited: NZ income snapshot in 2011

nzis.df <- read.csv("datasets/NZIS-CART-SURF-2011.csv")
histogram( ~ income, data = nzis.df, nint = 50, type = "count",
          xlab = "Gross Weekly Income ($)",
          main = "NZer's gross weekly income snapshot in 2011")

Figure: The gross weekly income of 29447 New Zealanders

t.test(income ~ 1, data = nzis.df,
       alternative = "greater", mu = 1000)

    One Sample t-test

data:  income
t = -65.552, df = 29446, p-value = 1
alternative hypothesis: true mean is greater than 1000
95 percent confidence interval:
 684.4214      Inf
sample estimates:
mean of x 
 692.1465 

CS 4.1: Are they getting enough sleep?

It is generally recommended that adults sleep at least 8 hours each night. A lecturer recently asked some of her students how many hours each had slept the previous night, curious as to whether her students were getting enough sleep.

The 12 students sampled averaged 6.2 hours of sleep with a standard deviation of 1.7 hours. Assuming that this sample meets the assumptions, does this data provide evidence (at the 5% significance level) that her students, on average, are not getting enough sleep?

You use the fact that \(t^\ast_{0.975}(11) = 2.20\)

CS 4.1: Are they getting enough sleep?

The exact p-value was 0.0037

What about the 95% confidence interval for the population mean hours of sleep?

. . .

The 12 students sampled averaged 6.2 hours of sleep with a standard deviation of 1.7 hours. Assuming that this sample meets the assumptions, does this data provide evidence (at the 5% significance level) that her students, on average, are not getting enough sleep?

You use the fact that \(t^\ast_{0.975}(11) = 2.20\)

Hypothesis testing & Confidence intervals

Generally speaking, when there is a two-sided hypothesis test1 and a confidence interval for a parameter. The significance level, \(\alpha\), determines the significance threshold for the p-value and the width of the confidence interval

What about a one-sided hypothesis test?

Lies, damn lies, and statistics

On the topic of a p-value

  • p-values are often misinterpreted1
  • It is easier to find significant results with larger sample sizes
  • If we collected new data, we would also get a “small” p-value?
  • Could we also arbitrarily “hack” the p-value?